Analysis of the Curse Log

Introduction

With this worksheet, you will learn the first steps with Jupyter, Python, pandas and matplotlib using a practical example: We execute an analysis of the futuristic data set "curse log" – a log of time-tracked swearwords events gathered with the microchip implant "FUTURE 2000" from several humans.

We managed to get the log of uttered curses of some users. We also have some user profile data that we can combine with the curse log to find out, who the cursers are.

Tasks

We want to find out
- which curses are the most popular
- at what hour of the day most curses are made.
- which people are in the TOP 10 cursers' list
- if men curse more often than woman
- the favorite curse for each job

As a starting point, we have a log from several users recorded in a file that lists the time stamp, the curse word and the id of the user's profile each curse:

timestamp,curse,profile_id
2132-12-31 14:47:43,The A-word,0
2132-12-31 13:13:56,The F-word,0
2132-12-31 13:03:05,The S-word,0
2132-12-31 12:30:34,The F-word,0
2132-12-31 12:29:02,The S-word,0

Hint: This data is a data set from the future. But since time machines aren't invented yet, we are working with a generated / synthetic dataset based on a real data set from an other domain.

Let's get to know the tools we use!

Jupyter

Jupyter offers us code and documentation in executable cells.

Code execution

select the next cell (mouse click or arrow keys).
execute the cell with a Ctrl+Enter.
move the cell again with Shift+Enter. What's the difference between the output of results?



In [1]:

    
"Hello World"









    Out[1]:





'Hello World'

Create new cell

if it hasn't happened yet, select this cell.
enter command mode, selectable with ESC key.
create a new cell after this text with the b key.
change the cell type to "Markdown" with key m.
switch to edit mode with Enter (note the color to the left of the cell, which turns green instead of blue).
write a text, which you then "execute" with Ctrl + Enter.

This is a text

Python

We look at very basic functions:

variable assignments
value range accesses
method calls

Assign text to a variable

assign the text value "Hello World" to the variable text by using the syntax <variable> = <value>.
type the variable text in the next line and execute the cell.



In [2]:

    
text = "Hello World"
text









    Out[2]:





'Hello World'

Access values

access the first letter in text with [0].



In [3]:

    
text[0]









    Out[3]:





'H'

Select last value

access the last letter in text with [-1].



In [4]:

    
text[-1]









    Out[4]:





'd'

Select value ranges

access a range of text with the slice [2:5].



In [5]:

    
text[2:4]









    Out[5]:





'll'

Auto completion

append a . to text and look at the functions with the Tab key.
execute the method upper() (Tip: Type a u in the function overview).



In [6]:

    
text.upper()









    Out[6]:





'HELLO WORLD'

Interactive documentation

select the split function of text.
press Shift+Tab.
press Shift+Tab twice in quick succession.
press Shift+Tab four times in quick succession (and then ESC to hide)
split the text in text with split exactly once (parameter maxsplit) apart by using the l ("L") as separator (parameter sep).



In [7]:

    
text.split("l",maxsplit=1)









    Out[7]:





['He', 'lo World']

Pandas

Import data

Load a module

import the module pandas with import <module> as <abbreviation> as abbreviated pd
in the next line, attached a ? to pd and execute the cell to get some information about pd.



In [8]:

    
import pandas as pd
pd?

Import file

use the read_csv method to read the data from the file curse_log.csv.
write the result into the variable log.
display the first five entries (= curses) in log with the head() method.



In [9]:

    
log = pd.read_csv("../dataset/curse_log.gz")
log.head()









    Out[9]:







  
    
      
      timestamp
      curse
      profile_id
    
  
  
    
      0
      2132-12-31 14:47:43
      The A-word
      0
    
    
      1
      2132-12-31 13:13:56
      The F-word
      0
    
    
      2
      2132-12-31 13:03:05
      The S-word
      0
    
    
      3
      2132-12-31 12:30:34
      The F-word
      0
    
    
      4
      2132-12-31 12:29:02
      The S-word
      0

Getting to know the data

call info() on log.



In [10]:

    
log.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 723212 entries, 0 to 723211
Data columns (total 3 columns):
timestamp     723212 non-null object
curse         723212 non-null object
profile_id    723212 non-null int64
dtypes: int64(1), object(2)
memory usage: 16.6+ MB

We see that log is

a DataFrame that consists of
- three columns (so-called Series) timestamp, curse and profile_id

Aggregate data

Top curses

sum up the number of curses in the column curse in log with the method value_counts().
save the result in the variable top_curses.
list the result in top_curses.

Note: In this tutorial, we access Series directly with the .<Series> notation (e. g. log.curse). This works only if the names of the Series are different from the provided functions of a Series. E. g. it doesn't work, when you try to access a Series named count, because count() is a function of a Series. Here, you have to use the ['<Series name>'] notation (e.g. log['count']. When in doubt, always use the ['<Series name>'] notation (but which disables the auto-completion feature)



In [11]:

    
top_curses = log.curse.value_counts()
top_curses









    Out[11]:





The F-word    289389
The S-word    202648
The A-word    122837
The D-word     72415
Other          35923
Name: curse, dtype: int64

matplotlib

Visualization

Plot diagram

tell Jupyter with %matplotlib inline to display generated graphics directly in the notebook.
create a diagram of the Series top_curses with plot().



In [12]:

    
%matplotlib inline
top_curses.plot()









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x207dc268588>

Create a bar chart

call the bar() sub-method of plot for the data in log.



In [13]:

    
top_curses.plot.bar()









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x207dc3341d0>

Improve the output

add a ; to the call above and re-execute it. What has changed?



In [14]:

    
top_curses.plot.bar();

Create a pie chart

call the pie() sub-method of plot for the data in top_curses.



In [15]:

    
top_curses.plot.pie();

Create a beautiful pie chart

create another pie chart, this time directly with plot() of the Series top_cursess and with the following parameters:
- kind="pie"
- figsize=[7,7]
- title="Top curses"
- label=""

Tip: Use auto completion.



In [16]:

    
top_curses.plot(
    kind='pie',
    title="Top curses",
    label="",
    figsize=[5,5]);

Time series analysis

View timestamp column

display the first five entries of the series timestamp.



In [17]:

    
log.timestamp.head()









    Out[17]:





0    2132-12-31 14:47:43
1    2132-12-31 13:13:56
2    2132-12-31 13:03:05
3    2132-12-31 12:30:34
4    2132-12-31 12:29:02
Name: timestamp, dtype: object

Data type conversion

use the pandas function pd.to_datetime to convert the column timestamp into a real date data type.
write the result into the new variable ts (abbreviation for "timestamp").
output the first five entries.



In [18]:

    
ts = pd.to_datetime(log.timestamp)
ts.head()









    Out[18]:





0   2132-12-31 14:47:43
1   2132-12-31 13:13:56
2   2132-12-31 13:03:05
3   2132-12-31 12:30:34
4   2132-12-31 12:29:02
Name: timestamp, dtype: datetime64[ns]

Assigning data to a Series

override the values of the timestamp Series with the data in ts.
print the first entries.



In [19]:

    
log['timestamp'] = ts
log.head()









    Out[19]:







  
    
      
      timestamp
      curse
      profile_id
    
  
  
    
      0
      2132-12-31 14:47:43
      The A-word
      0
    
    
      1
      2132-12-31 13:13:56
      The F-word
      0
    
    
      2
      2132-12-31 13:03:05
      The S-word
      0
    
    
      3
      2132-12-31 12:30:34
      The F-word
      0
    
    
      4
      2132-12-31 12:29:02
      The S-word
      0

Working with hourly data

access the date object dt of the Series timestamp.
inspect the hours of the hour property of the dt object.
store the hours into the new Series hour of the log DataFrame.
print out the first five entries.



In [20]:

    
log['hour'] = log.timestamp.dt.hour
log.head()









    Out[20]:







  
    
      
      timestamp
      curse
      profile_id
      hour
    
  
  
    
      0
      2132-12-31 14:47:43
      The A-word
      0
      14
    
    
      1
      2132-12-31 13:13:56
      The F-word
      0
      13
    
    
      2
      2132-12-31 13:03:05
      The S-word
      0
      13
    
    
      3
      2132-12-31 12:30:34
      The F-word
      0
      12
    
    
      4
      2132-12-31 12:29:02
      The S-word
      0
      12

Find out favorite cursing times

sum up the number of curses per each hour.
1. here, switch off the sorting with the parameter sort=False.
save the result in curses_per_hour.



In [21]:

    
curses_per_hour = log.hour.value_counts(sort=False)
curses_per_hour.head()









    Out[21]:





0    19533
1    15044
2    10420
3     7000
4     6068
Name: hour, dtype: int64

Visualize the hourly cursing result

plot a bar chart of the hourly cursing counts.



In [22]:

    
curses_per_hour.plot.bar();

Merging data sets

Now it's time to find out, which users are cursing the most. We have another data set in the Excel file profiles.xlsx with the following content:

The columns contain this information for all FUTURE 2000 users. It includes the unique identification number of a user's profile (matches profile_id in the curse log) as well as the name, birth date, sex and current job.

We combine this data with our log DataFrame to check off the remaining items on our to-do list:

Find out

which people are in the TOP 10 cursers' list
if men curse more often than woman
the favorite curse for each job So let's do it!

Read in Excel files

use the read_excel method of Pandas to read in the file profiles.xlsx into the DataFrame / variable profiles.
display the first five rows of the DataFrame.



In [23]:

    
profiles = pd.read_excel("../dataset/profiles.xlsx")
profiles.head()









    Out[23]:







  
    
      
      name
      birthdate
      sex
      job
    
  
  
    
      0
      Theresa Nichols
      2095-11-22
      F
      Brewing technologist
    
    
      1
      Monica Sanchez
      2098-05-05
      F
      Administrator (Civil Service)
    
    
      2
      Lori Briggs
      2082-07-06
      F
      Artist
    
    
      3
      Karen Marks
      2063-01-10
      F
      Teacher (adult education)
    
    
      4
      Gordon Phillips
      2056-11-01
      M
      Therapist (art)

Join datasets

use the method join() on the log DataFrame
1. as first argument, put in the DataFrame profiles
2. as second argument, add the parameter on='profile_id' to join log's profile_id column with the id (=index) column of the profiles data set.
store the result into the variable curse_profiles.
display the first entries of curse_profiles.



In [24]:

    
curse_profiles = log.join(profiles, on='profile_id')
curse_profiles.head()









    Out[24]:







  
    
      
      timestamp
      curse
      profile_id
      hour
      name
      birthdate
      sex
      job
    
  
  
    
      0
      2132-12-31 14:47:43
      The A-word
      0
      14
      Theresa Nichols
      2095-11-22
      F
      Brewing technologist
    
    
      1
      2132-12-31 13:13:56
      The F-word
      0
      13
      Theresa Nichols
      2095-11-22
      F
      Brewing technologist
    
    
      2
      2132-12-31 13:03:05
      The S-word
      0
      13
      Theresa Nichols
      2095-11-22
      F
      Brewing technologist
    
    
      3
      2132-12-31 12:30:34
      The F-word
      0
      12
      Theresa Nichols
      2095-11-22
      F
      Brewing technologist
    
    
      4
      2132-12-31 12:29:02
      The S-word
      0
      12
      Theresa Nichols
      2095-11-22
      F
      Brewing technologist

TOP 10 cursers

Find the users with the most uttered curses

Count the top 10 curses



In [25]:

    
curse_profiles.name.value_counts().head(10)









    Out[25]:





Theresa Nichols       24259
Henry Alvarez          9563
Nicole Weiss           6917
William Johnson        6653
Stephanie Sullivan     6293
Deborah Stephenson     6064
Jerry Reynolds         5942
Pamela Taylor          5463
Jason Orr              5384
Ann Gibson             4687
Name: name, dtype: int64

Cursing genders

Get the ratio between male to all curses in percent

select only men (value 'M') with the selector notation <DataFrame>[<DataFrame>.<Series> == <value>]
count the number of returned rows of the Series sex.
divide that number by all entries of the Series sex of the DataFrame curse_profiles.



In [26]:

    
curse_profiles[curse_profiles.sex == 'M'].sex.count() / curse_profiles.sex.count()









    Out[26]:





0.48295935355054948

Favorite curse per job (advanced level)

Grouping data

group together the curse_profiles' data along job and curse by using groupby and the list ['job', 'curse'] as argument.
count the values for the Series sex (or any other left Series).
store the returned Series into the variable job_curses.
display the first 10 entries of the Series.



In [27]:

    
job_curses = curse_profiles.groupby(['job', 'curse']).sex.count()
job_curses.head(10)









    Out[27]:





job                    curse     
Academic librarian     Other          39
                       The A-word    107
                       The D-word     67
                       The F-word    284
                       The S-word    194
Accommodation manager  Other          21
                       The A-word     54
                       The D-word     39
                       The F-word    133
                       The S-word     89
Name: sex, dtype: int64

Find the maximum per group

group job_curses again along job.
use the transform method with the argument 'max'.
store the result in max_per_group.
print the first rows of the result.



In [28]:

    
max_per_job = job_curses.groupby('job').transform('max')
max_per_job.head()









    Out[28]:





job                 curse     
Academic librarian  Other         284
                    The A-word    284
                    The D-word    284
                    The F-word    284
                    The S-word    284
Name: sex, dtype: int64

Filter maximum group values

filter with a selector the max values per group.
store the result in favorite_curses_per_job.
print the first rows.



In [29]:

    
favorite_curses_per_job = job_curses[job_curses == max_per_job]
favorite_curses_per_job.head()









    Out[29]:





job                                    curse     
Academic librarian                     The F-word    284
Accommodation manager                  The F-word    133
Accountant (chartered certified)       The F-word    264
Accountant (chartered management)      The F-word    166
Accountant (chartered public finance)  The F-word    210
Name: sex, dtype: int64

Count favorite curses

use reset_index() on favorite_curses_per_job to get rid of the grouping index.
count the occuring values for the curse Series



In [30]:

    
favorite_curses_per_job.reset_index().curse.value_counts()









    Out[30]:





The F-word    637
The S-word      3
Name: curse, dtype: int64

Unstack Series

use unstack() on favorite_curses_per_job
store the result in favorite_curse_words



In [31]:

    
favorite_curse_words = favorite_curses_per_job.unstack()
favorite_curse_words.head()









    Out[31]:







  
    
      curse
      The F-word
      The S-word
    
    
      job
      
      
    
  
  
    
      Academic librarian
      284.0
      NaN
    
    
      Accommodation manager
      133.0
      NaN
    
    
      Accountant (chartered certified)
      264.0
      NaN
    
    
      Accountant (chartered management)
      166.0
      NaN
    
    
      Accountant (chartered public finance)
      210.0
      NaN

Identify outliers

Show the jobs in favorite_curse_words that have the curse word 'The S-word' as favorite.



In [32]:

    
favorite_curse_words[~favorite_curse_words['The S-word'].isnull()]









    Out[32]:







  
    
      curse
      The F-word
      The S-word
    
    
      job
      
      
    
  
  
    
      Cartographer
      45.0
      45.0
    
    
      Podiatrist
      NaN
      29.0
    
    
      Speech and language therapist
      NaN
      51.0

The end

If you read this: Well done! You did it!

What's missing

You have now learned some basics about pandas. This will get us a long way in our daily work. The other important topics that are still missing are:

reading in complicated, semi-structured data structures
cleansing of poor data
merging different data sources with merge
transforming of DataFrames with pivot_table.

Summary

I hope that this mini-tutorial will show you the potential of data analysis using Jupyter, Python, pandas and matplotlib!

I am looking forward to your comments and feedback!

Contact

Markus Harrer
Blog: https://www.feststelltaste.de
Mail: talk@markusharrer.de
Twitter: @feststelltaste
Consulting and training: http://markusharrer.de

	timestamp	curse
0	2132-12-31 14:47:43	The A-word
1	2132-12-31 13:13:56	The F-word
2	2132-12-31 13:03:05	The S-word
3	2132-12-31 12:30:34	The F-word
4	2132-12-31 12:29:02	The S-word

	name	birthdate	sex	job
0	Theresa Nichols	2095-11-22	F	Brewing technologist
1	Monica Sanchez	2098-05-05	F	Administrator (Civil Service)
2	Lori Briggs	2082-07-06	F	Artist
3	Karen Marks	2063-01-10	F	Teacher (adult education)
4	Gordon Phillips	2056-11-01	M	Therapist (art)

curse	The F-word	The S-word
job
Academic librarian	284.0	NaN
Accommodation manager	133.0	NaN
Accountant (chartered certified)	264.0	NaN
Accountant (chartered management)	166.0	NaN
Accountant (chartered public finance)	210.0	NaN

curse	The F-word	The S-word
job
Cartographer	45.0	45.0
Podiatrist	NaN	29.0
Speech and language therapist	NaN	51.0